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Abstract HIV-1 sequence diversity is affected by selection pressures arising from host genomic 
factors. Using paired human and viral data from 1071 individuals, we ran >3000 genome-wide scans, 
testing for associations between host DNA polymorphisms, HIV-1 sequence variation and plasma 
viral load (VL), while considering human and viral population structure. We observed significant 
human SNP associations to a total of 48 HIV-1 amino acid variants (p<2.4 x 10~ 12 ). All associated 
SNPs mapped to the HLA class I region. Clinical relevance of host and pathogen variation was 
assessed using VL results. We identified two critical advantages to the use of viral variation for 
identifying host factors: (1) association signals are much stronger for HIV-1 sequence variants than 
VL, reflecting the 'intermediate phenotype' nature of viral variation; (2) association testing can be 
run without any clinical data. The proposed genome-to-genome approach highlights sites of 
genomic conflict and is a strategy generally applicable to studies of host-pathogen interaction. 
DOI: 10.7554/eLife.01 123.001 



eLife digest Developing treatments or vaccines for HIV is challenging because the genetic 
makeup of the virus is constantly changing in an effort to outwit the human immune system. 
Moreover, the immune system is highly variable as a result of the long-standing co-evolution of 
humans and microbes. Each individual will try to oppose the invading virus in a unique way, forcing 
the virus to acquire specific mutations that can be interpreted as the genetic signature of this 
one-against-one battle. 

To explore the influence of co-evolution on HIV, Bartha et al. took samples of both human and 
viral genomes from 1071 individuals infected with HIV, the AIDS virus, and used genotyping and 
sequencing technology to obtain a comprehensive description of the genetic variation in both. 
Computational techniques were then used to search for links between variants in the human DNA 
sequences and variants in the viral sequences. 

The most common type of genetic variation found in the human genome is a single nucleotide 
polymorphism, or SNP for short: a SNP is produced when a single nucleotide - an A, C, G or T - is 
replaced by a different nucleotide. Bartha et al. found that SNPs within the human DNA sequences in 
their study were linked to variations in 48 amino acids in HIV. Moreover, all these SNPs were found within 
a group of genes known as the HLA (human leukocyte antigen) system, which encodes for proteins 
that play a vital role in the immune response. This work identified the areas of the human genome 
that put pressure on the AIDS virus, and the regions of the virus that serve to escape human control. 

The approach developed by Bartha et al. allows the interactions between a microbe and a human 
host to be studied by looking at the genome of the microbe and the genome of the infected 
person. It also differentiates host-induced mutations that limit the capacity of the virus to do harm 
from those that are tolerated by the pathogen. A similar strategy could be used to study other 
infectious diseases. 
DOI: 10.7554/eLife.01 123.002 



Introduction 

Through multiple rounds of selection and escape, host and pathogen genomes are imprinted with 
signatures of co-evolution that are governed by Darwinian forces. On the host side, well-characterized 
anti-retroviral restriction factors, such as TRIM5a, APOBEC3G and BST2, harbor strong signals of 
selection in primate genomes, clear examples of retroviral pressure {Ortiz et al., 2009). On the virus 
side, obvious signs of selection are observable in the HIV-1 genome: escape mutations and reversions 
have been described in epitopes restricted by human leukocyte antigen (HLA) class I molecules and 
targeted by cytotoxic T lymphocyte (CTL) responses (Goulder et al., 2001; Kawashima et al., 2009). 
Sequence polymorphisms have also been reported recently in regions targeted by killer immunoglobulin- 
like receptors (KIR), suggesting evasion from immune pressure by natural killer (NK) cells (Alter et al., 
2011). Evidence for the remodeling of retroviral genomes by host genetic pressure also comes from 
simian immunodeficiency virus (SIV) infection studies in rhesus macaques, where escape from restric- 
tive TRIM5ct alleles has been observed in the viral capsid upon cross-species transmission of SIVsm 
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(Kirmaier et al., 2010). In contrast, human alleles of TRlM5ct do not result in escape mutations, likely 
because of adaptation of the pathogen to the host (Rahm et al., 2013). Sequence adaptation is also 
a known feature of cross-species transmission. For example, a methionine in the matrix protein (Gag-30) 
in SIV cpzPtt changed to arginine in lineages leading to HIV-1 and reverted to methionine when HIV-1 was 
passaged through chimpanzees (Wain et al., 2007). 

To date, combined analyses of human and HIV-1 genetic data have addressed the association of 
HLA and KIR genes with variants in the retroviral genome {Moore et al., 2002; Brumme et al., 2007; 
Bhattacharya et al., 2007; Kawashima et al., 2009; Alter et al., 2011; Carlson et al., 2012; Wright 
et al., 2012). Additionally, genome-wide association studies (GWAS) performed in the host have 
focused on various HIV-related clinical phenotypes {Fellay et al., 2007; Fellay et al., 2009; Pereyra 
et al., 2010). In parallel, large amounts of HIV-1 sequence data have been generated for phylogenetic 
studies, which shed new light on viral transmission and evolution (Kouyos et al., 2010; Alizon et al., 
2010; Von Wyl et al., 2011), or allow clinically driven analyses of viral genes targeted by antiretroviral 
drugs (resistance testing) (Von Wyl et al., 2009). 

Building on the unprecedented possibility to acquire and combine paired human and viral genomic 
information from the same infected individuals; we employ an innovative strategy for global genome-to- 
genome host-pathogen analysis. By simultaneously testing for associations between genome-wide human 
variation, HIV-1 sequence diversity, and plasma viral load (VL), our approach allows the mapping of all sites 
of host-pathogen genomic interaction, the correction for both host and viral population stratification, 
and the assessment of the respective impact of human and HIV-1 variation on a clinical outcome {Figure 1). 

Results 

Study participants, host genotypes, and HIV-1 sequence variation 

Full-length HIV-1 genome sequence and human genome-wide SNP data were obtained from seven 
studies or institutions on a total of 1071 antiretroviral naive patients of Western European ancestry, 
infected with HIV-1 subtype B. The homogeneity of the study population was confirmed by principal 
component analysis of the genotype matrix: together, the first five principal components explained 1% 
of total genotypic variation. After quality control of the human genotype data, imputation and filtering, 
~7 million SNPs were available for association testing. The full-length HIV-1 sequence is approximately 
9.5 Kb long, corresponding to over 3000 encoded amino acids. Not all sequences were complete; on 
an average, viral residues were covered in 85% of the study population (range: 75% in Tat to 95% in 
Gag). Due to its hypervariable nature, the portion of the HIV-1 envelope gene that encodes the gp1 20 
protein was not sequenced in most study samples and was therefore excluded. Overall 1 1 26 residues 
of the HIV-1 proteome were found to be variable in at least 10 samples, for a total of 3381 different 
viral amino acids that could be represented by 3007 distinct binary variables. 

Host VL GWAS 

We first performed a classical GWAS of host 
determinants of HIV-1 VL {Figure 2A, Study A ) 
using data from 698 patients (65% of the study 
population) for whom a VL phenotype could be 
reliably estimated. The top associations were 
observed in the HLA class I region on chromo- 
some 6 and were highly consistent with results 
observed previously (Fellay et al., 2009; Pereyra 
et al., 2010). The strongest associated SNP, 
rs9267454 (p = 1.5 x 10 8 ), is in partial linkage dis- 
equilibrium (LD) with HLA-B*57:01 (r 2 = 0.47, D' = 
0.92), HLA-B*14:01 (r 2 = 0.12, D' = 1.0), HLA- 
B*27:05 (r 2 = 0.01, D' = 0.99), and the HLA-C -35 
rs9264942 SNP (r 2 = 0.07, D' = 0.77), and thus 
reflects these well-known associations with HIV-1 
control. These results confirm the quality of the 
study population for the purpose of genome anal- 
ysis of determinants of HIV-1 -related outcomes. 
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Figure 1 . A triangle of association testing. The 
following association analyses were performed: [Study 

A] human SNPs vs plasma viral load (1 GWAS); [Study 

B] human SNPs vs variable HIV-1 amino acids (3007 
GWAS); and [Study C] variable HIV-1 amino acids vs 
plasma viral load (1 proteome-wide association study). 
DOI: 10.7554/eLife.01 123.003 
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Figure 2. Results of the genome-wide association analyses. (A) Associations between human SNPs and HIV-1 plasma viral load. The dotted line shows the 
Bonferroni-corrected significance threshold (p-value < 7.25 x 10"'). (B) Associations between human SNPs and HIV-1 amino acid variants, with 3007 GWAS 
collapsed in a single Manhattan plot. The dotted line shows the Bonferroni-corrected significance threshold (p-value < 2.4 x 10~ 12 ). (C) Schematic representation 
Figure 2. Continued on next page 
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Figure 2. Continued 

of the HLA class I genes and of the SNPs associated with HIV-1 amino acid variants in the region. (D) Same association results as in panel B, projected on the 
HIV-1 proteome. Only the strongest association is shown for each amino acid. Significant associations are indicated by a blue dot. The gp120 part of the HIV-1 
proteome was not tested. The colored bar below the plot area shows the positions of the optimally defined CD8+ T cell epitopes. An interactive version of this 
figure can be found at http://g2g.labtelenti.org (which is also available to download from Zenodo, http://dx.doi.org/10.5281/zenodo.7138). 

DOI: 10.7554/el_ife.01 123.004 



Genome-to-genome analyses 

3007 genome-wide analyses of associations between human SNPs and HIV-1 amino acid variants were 
performed in the full sample of 1071 individuals {Figure 2B, Study B ) using logistic regression cor- 
rected for viral phylogeny {Carlson et al., 2008; Carlson et al., 2012). Highly significant associations 
were observed between SNPs in the major histocompatibility complex (MHC) region and multiple 
amino acids throughout the HIV-1 proteome (except in Vpu, Rev and the RNaseH subunit of RT) 
{Figure 2Q, with Gag and Nef having a significantly higher density of associated variable sites than the 
rest of the proteome (Gag: 6.8% vs 2.6% p=0.001; Nef: 11% vs 2.6% p = 1.2 x 10 5 , binomial tests). 
Using Bonferroni correction for multiple testing (threshold p = 2.4 x 10~ 12 ), significant human SNP 
associations were observed with 48 viral amino acids {Figure 2 and Table 1). None of these 48 amino 
acids mapped to known sites of major antiretroviral drug resistance mutations {Hirsch et a/., 2008). 
The strongest association found was between rs72845950 and Nef position 135 (p = 2.7 x 10~ 66 ). 
Associations were much stronger between human SNPs and HIV-1 amino acids than with VL. For 
example, the SNP rs2395029, a proxy for HLA-B*57:01 (r 2 = 0.93), has a p-value of 1 .21 x 1 0" 6 for 
association with VL, while it reaches a p-value of 4 x 10~ 59 for association with amino acid variation 
in Gag at position 242 (a well known position of escape from HLA-B*57:01). No significant signals 
were identified outside the MHC. A link to the complete set of association results can be found at 
http://g2g.labtelenti.org (which is also available to download from Zenodo, http://dx.doi.org/10.5281/ 
zenodo. 71 39). These results demonstrate the feasibility and improved power of performing association 
testing using viral genetic variation as outcome, independent of clinical phenotype. 

SNPs, HLA alleles and CTL epitopes 

We next assessed whether the top SNPs associated with HIV-1 amino acids represent indirect markers 
of HLA class I alleles known to exert evolutionary pressure on HIV-1 {Table 1). We tested pairwise 
correlations between significant MHC SNPs and HLA class I alleles. The analysis confirmed the existence 
of high LD between SNPs and HLA alleles targeting corresponding epitopes. For example, the 
strongest association (p = 2.7 x 10" 66 ) was observed between residue 135 in Nef, located in an optimally 
defined A*24:02 epitope, and rs72845950, which strongly tags HLA-A*24:02 (r 2 = 0.89). Furthermore, 
we observed that a substantial fraction of the identified viral amino acids (24/48, 50%) were located 
within an optimally defined CTL epitope restricted by one or more HLA alleles tagged by the associ- 
ated SNP (http://www.hiv.lanl.gov/content/immunology/tables/optimal_ctl_summary.html, supple- 
mented with a recently updated list of epitopes [Carlson et al., 2012]). However, in seven cases, 
the classical HLA allele implicated through LD with a tagging SNP did not match previously reported 
restriction patterns {Table 1). These data demonstrate that this approach can reconstruct a map of 
targets of HLA pressure across the viral proteome and identify sites outside classical epitopes that 
could represent additional escape variants or compensatory mutations. That a substantial proportion 
of associated viral amino acids lay outside known CTL epitopes also highlights this approach as a tool 
to guide novel epitope discovery {Bhattacharya et al., 2007; Almeida et al., 2011). 

Analysis of polymorphic amino acids within the HLA genes has been shown to improve power for 
detection of association with clinical outcome and has demonstrated the biological relevance of key 
residues in the HLA-B binding groove {Pereyra et al., 2010). Therefore, we used the genome-to- 
genome framework to characterize the evolutionary pressure of HLA class I amino acids on the viral 
genome. The top associations in all classical class I genes mapped to discrete residues in the binding 
grooves of the HLA molecule: HLA- A position 62 (p = 3.3 x 10 76 with HIV Nef 135), HLA-B position 70 
(p = 7.1 x 10- 57 with HIV Gag 242), HLA-C position 99 (p = 5.4 x 10 63 with HIV Nef 70). These data 
indicate that all class I HLA genes can exert strong pressure on the viral proteome through a shared 
mechanism. The association results for HLA amino acids can also be found at http://g2g.labtelenti.org 
(which is also available to download from Zenodo). 
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HIV-1 amino acids vs plasma viral load 

To address whether there was an observable impact of viral mutation on a clinical outcome in this 
sample, we tested for associations between all HIV-1 amino acid variant and VL (Study C). After correc- 
tion for multiple testing (p threshold = 1 .6 x 10~ 5 based on 3125 viral amino acids), we did not observe 
any significant associations. We then focused on estimating the changes in VL associated with the 48 
HIV-1 amino acid variants that were identified as significantly associated with host SNPs (Figure 2D). 
The effects of amino acid variation at these sites on VL ranged from -0.16 to +0.07 log 10 copies/ml 
{Figure 3A). We also explored the combined fitness effect of multiple HIV-1 viral amino acid variants 
targeted by a single host marker using the well-understood model of HLA-B*57:01. We evaluated the 
effect on VL of 23 viral residues that associated with host variant rs2395029 (r 2 = 0.93 with 
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Figure 3. Association of HIV-1 amino acid variants with plasma viral load. (A) Changes in VL (slope coefficients from the univariate regression model and 
standard error, log 10 copies/ml) for the 48 HIV-1 amino acids that are associated with host SNPs in the genome-to-genome analysis. (B) rs2395029, a 
marker of HLA-B*57:01 is associated with a 0.38 log 10 copies/ml lower VL (black bar) in comparison to the population mean. Gray bars represent changes 
in VL for amino acid variants associated with rs2395029 (p<0.001). In case of multiallelic positions, the change in VL is shown for all minor amino acids 
combined vs the major amino acid (e.g., GAG147 not I). 
DOI: 10.7554/eLife.01 123.006 
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HLA-B*57:01) in the genome-to-genome analysis (selection cutoff: p<0.001). The marker rs2395029 
was associated with a 0.38 log decrease in viral RNA copies/ml. The univariate effect on VL for each of 
the 23 viral amino acids targeted by this allele ranged from -0.16 to +0.12 {Figure 3B). These results 
suggest that the genome-to-genome approach can be linked to clinical/laboratory phenotypes, 
allowing for detailed understanding of the distribution and relative contribution of sites of host- 
pathogen interaction to disease outcome. 

Discussion 

HIV-1 host genomic studies performed so far have focused on clinically defined outcomes (resistance 
to infection, clinical presentation, disease progression or death) or on pathogen-related laboratory 
results (such as CD4+ T cell counts and VL set point). While useful, these phenotypes have significant 
drawbacks. First, consistency of phenotypic determination can be hard to achieve, and such inconsist- 
ency can adversely affect power in large-scale genetic studies performed across multiple centers 
(Evangelou eta/., 2011). Second, a relatively long follow-up in the absence of antiretroviral treatment 
is necessary to obtain informative data about the natural history of infection. However, international 
guidelines now propose an early start of antiretroviral therapy in most HIV-1 infected individuals 
(Thompson et a/., 2012), making the collection of large numbers of long-term untreated patients not 
only unrealistic but also ethically questionable. 

To overcome these limitations, we developed a novel approach for host genetic studies of infec- 
tious diseases, built on the unprecedented possibility to obtain paired genome-wide information from 
hosts and pathogens. We combined human polymorphism and HIV-1 sequence diversity in the same 
analytical framework to search for sites of human-virus genomic conflict, effectively using variation in 
HIV-1 amino acids as an 'intermediate phenotype' for association studies. Intermediate phenotypes 
have recently been shown to be useful in uncovering association signals that are not detectable using 
more complex clinical endpoints: illustrative examples include metabolomic biomarkers in cardiovascular 
research (Suhre et a/., 2011), serum IgE concentration in the study of asthma (Moffatt et a/., 2010), 
or neuroimaging-based phenotypes in psychiatry genetics {Rasetti and Weinberger, 2011). Variation 
in the pathogen sequence is an as-yet-untapped intermediate phenotype, specific by nature to 
genomic research in infectious diseases. Importantly, it depends on sequencing the pathogen, 
which could prove in many cases easier and more standardized than obtaining detailed clinical 
phenotypes. 

Our approach allowed the mapping of host genetic pressure on the HIV-1 genome. The strongest 
association signals genome-wide were observed between human SNPs tagging HLA class I alleles 
and viral mutations in their corresponding CTL epitopes. Additional association signals were observed 
outside of optimally defined CTL epitopes, which could indicate novel epitopes, or represent 
secondary (compensatory) mutations. In a single experiment, these results recapitulate extensive 
epidemiological and immunogenetic research and represent a proof-of-concept that biologically 
meaningful association signals are identifiable using a hypothesis-free strategy. Indeed, host factors 
leading to viral adaptation can be uncovered by searching for associated imprints in the viral genome. 
Of note, the International HIV Controllers Study demonstrated the importance of specific amino acid 
positions in the HLA-B binding groove on a clinical outcome (elite control) {Pereyra et a/., 2010). We 
here extend this observation to the HLA-A and C grooves, emphasizing the similarity in mechanism 
of host pressure on the viral proteome that is not necessarily translated into observable clinical 
outcomes. 

We found a higher density of amino acid positions under selection in Gag and Nef compared with 
the rest of the HIV proteome. This is consistent with earlier findings that indicate the importance of 
Gag p24-specific CTL responses in slower progression to AIDS {Borghans et a/., 2007; Brennan et a/., 
2012) or controller status {Dyer et a/., 2008). Moreover, this further demonstrates that mapping host 
pressure on the pathogen proteome can reveal biologically relevant effects. 

Analyses were performed using samples from clinically well-characterized patients, most of them 
with repeated and reliable HIV-1 VL measurements in the absence of antiretroviral therapy. We were 
thus able to compare the results of GWAS assessing human genetic determinants of mean VL, 
a standard clinical correlate of HIV-1 control, and genome-to-genome GWAS on amino acid variants 
in the viral proteome. The use of HIV-1 variation as outcome resulted in a considerable gain in power 
to detect host factors: the lowest p-values were observed for SNPs mapping to the HLA class I 
region in both approaches, but associations were much stronger with HIV-1 amino acid variation 
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than for HIV-1 VL (2.7 x 10" 66 vs 1 x 10" a ), even when accounting for the increased number of multiple 
tests. 

In addition to identifying sites of interaction between the host and the pathogen, the study design 
allowed the scoring of biological consequences of such interaction, by assessing associations between 
host-driven escape at viral sites and an in vivo phenotype (VL). For example, we decomposed the 
effect of rs2395029 (a marker of HLA-B*57:01 ) on VL to the effects of the multiple viral amino acid vari- 
ants that are associated with that SNP. While some HIV-1 amino acid changes individually associate 
with decrease in VL, the compound image that emerges is one of a multiplicity of modest effects 
distributed across many residues. Correlations between host-associated variants and VL are difficult 
to interpret, because they may reflect fitness costs or compensation, the existence of strong (/versen 
et al., 2006; Carlson et al.,, 2012) or novel (Almeida et a/., 2011) immune responses, or the indirect 
impact of specific HLA class I alleles. Nevertheless, the observation that the majority of host-associated 
HIV-1 mutations do not correlate with any detectable change in VL confirms HIV's remarkable capacity 
to adapt and compensate to immune pressure, often without measurable fitness cost. 

A significant confounder in both human and viral genomic analyses is the existence of population 
stratification, where shared ancestry between infected individuals, stratification by ethnic groups, non- 
random distribution of HIV-1 subtypes, or clusters of viral transmission can all have an influence on the 
population frequencies of specific mutations, and thus create spurious associations if not carefully 
controlled for. Previous studies usually controlled for viral population substructure but were limited in 
the control of human population stratification [Moore et al., 2002; Bhattacharya et al., 2007). Our 
approach offers the opportunity to correct for both factors, thanks to the availability of extensive host 
and viral genomic information. 

The present sample size provided approximately 80% power to detect a common human variant 
(minor allele frequency of 10%) with an odds ratio of 4.2 in the genome-to-genome analysis (Study B) 
and a viral amino acid explaining approximately 4% of the variation in plasma viral load (Study C) at 
the respective significance thresholds (Purcell et al., 2003). Consistent with most studies performed 
in HIV-1 host genetics over the past few years (reviewed in Telenti and Johnson (2012)), we did not 
identify previously unknown host genetic loci involved in host-viral interaction and HIV-1 restriction. 
The proposed approach can only detect polymorphic host factors that leave an imprint on the virus, 
which may exclude mediators of immunopathogenesis or genes involved in the establishment of 
tolerance {Medzhitov et al., 2012). An additional limitation is the incomplete nature of genomic 
information available both on the host side (common genotypes from GWAS) and on the viral side 
(near full-length consensus sequence; gp120 was not included in the analyses). Finally, the multiple 
hypothesis burden of a genome-to-genome scan is extremely high. It is conceivable that larger studies, 
or studies that focus on a subgroup of predefined host genes, would have power to detect novel asso- 
ciations. A comprehensive, but computationally challenging description of host-pathogen genomic 
interactions would require human genome sequencing, coupled with deep sequencing of intra-host 
retroviral subpopulations. 

In summary, we used a genome-to-genome, hypothesis-free approach to identify associations 
between host polymorphisms and HIV-1 genomic variation. This strategy allows a global assessment 
of host-pathogen interactions at the genome level and reveals sites of genomic conflict. Comparable 
approaches are immediately applicable to explore other important infectious diseases, as long as poly- 
morphic host factors exert sufficient selective pressure to trigger escape mutations in the pathogen. 
The observation that pathogen sequence variation, used as an intermediate phenotype, is more 
powerful than clinical and laboratory outcomes to identify some host factors allows smaller-scale 
studies and encourages analyses of less prevalent infectious diseases. Researchers involved in pathogen 
genome studies and host genetic studies should strongly consider the gathering of paired host-pathogen 
data. 

Materials and methods 
Ethics statement 

Participating centers provided local Institutional Review Board approval for genetic analysis. Study 
participants provided informed consent for genetic testing, with the exception of a subset where a 
procedure approved by the relevant Research Ethics Board allowed the use of anonymized historical 
specimens in the absence of a specific informed consent. 
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Participants 

Study participants are treatment-na'fve individuals followed in one of the following cohorts or institu- 
tions: the Swiss HIV Cohort Study (SHCS, www.shcs.ch, [Schoeni-Affolter et ai, 2010]); the HAART 
Observational Medical Evaluation and Research (HOMER) study in Vancouver, Canada (www.cfenet. 
ubc.ca/our-work/initiatives/homer); the AIDS Clinical Trials Group (ACTG) Network in the USA 
(actgnetwork.org); the International HIV Controllers Study in Boston, USA (IHCS, www.hivcontrollers. 
org); Western Australian HIV Cohort Study, Perth, Australia; the AIDS Research Institute IrsiCaixa in 
Badalona, Spain; and the Instituto de Salud Carlos III in Madrid, Spain. To reduce noise due to host and 
viral diversity, we only included individuals of recent Western European ancestry (confirmed by clustering 
with HapMap CEU individuals in principal component analysis of the genotype data [Price et a/., 
2006]), and infected with HIV-1 subtype B (as assessed by the REGA Subtyping Tool [De Oliveira 
et a/., 2005]). Plasma VL determinations in the absence of antiretroviral therapy were available from 
patients from the SHCS and the HOMER study. The VL phenotype was defined as the average of 
the log 10 -transformed numbers of HIV-1 RNA copies per ml of plasma, excluding measurements 
obtained in the first 6 months after seroconversion and during advanced immunosuppression (i.e., 
with <100 CD4+ T cells per ml of blood). Consequently, 698 study participants were eligible for VL 
analysis. 

Human genotype data 

DNA samples were genotyped in the context of previous GWAS (Fellay et a/., 2009; Pereyra et a/., 
2010) or for the current study on various platforms, including the HumanHap550, Human 660W-Quad, 
HumanIM and HumanOmniExpress BeadChips (lllumina Inc., San Diego, CA, USA), as well as the 
Genome-Wide Human SNP Array 6.0 (Affymetrix Inc., Santa Clara, CA, USA) {Table 2). Study partici- 
pants were filtered on the basis of genotyping quality, a sex check, and cryptic relatedness. SNP 
quality control was performed separately for each dataset: SNPs were filtered on the basis of missing- 
ness (excluded if called in <99% of participants), minor allele frequency (excluded if <0.01), and 
marked deviation from Hardy-Weinberg equilibrium (excluded if p<0. 00005). Missing genotype impu- 
tation was performed with the Mach software per genotyping platform (in separate batches for lllumina 
1M, OmniExpress, 550K and Affymetrix data) using 1000 Genomes Phase I CEU population data as 
reference haplotypes. Imputed markers were filtered on minor allele frequency (excluded if <0.01) and 
imputation quality using Mach's reported r-squared measure (excluded if <0.3). SNPs with a deviation 

in the allele frequencies between platforms 
were excluded. High-resolution HLA class I typing 
(4 digits; HLA- A, HLA-B, and HLA-C) was obtained 
using sequence-based methods, or imputed from 
the SNP genotyping data as described elsewhere 
(J/a et ai, 2013). 

HIV-1 sequence data 

Near full-length retroviral sequence data were 
obtained by bulk sequencing of viral RNA present 
in pretreatment-stored plasma, and in 1 1 cases, of 
proviral DNA isolated from peripheral blood mono- 
nuclear cells, as previously described (Sandonis 
et a/., 2009; John et a/., 2010). We defined an 
amino acid residue as variable if at least 10 study 
samples presented an alternative allele. Per posi- 
tion, separate binary variables were generated 
for each alternate amino acid, indicating the 
presence or absence of that allele in a given 
sample. 

Association analyses 

To globally assess the association between human 
genomic variation (SNPs), HIV-1 proteomic varia- 
tion (amino acids) and clinical outcome (VL), we 



Table 2. Distribution of samples across 
genotyping platforms and cohorts 



N 


Genotyping platform 


Cohort 


140 


lllumina 1 M 


ACTG 


6 


lllumina OmniExpress 12v1 H 


CARLOS III 


518 


Affymetrix 6.0 


HOMER 


136 


lllumina OmniExpress12v1 H 


HOMER 


47 


lllumina 650k 


IHCS 


6 


lllumina 660W-Quad 


IRSICAIXA 


2 


lllumina 1 M 


SHCS 


79 


lllumina 550k 


SHCS 


122 


lllumina OmniExpress12v1 H 


SHCS 


15 


lllumina 550k 


WAHCS 



ACTG = AIDS Clinical Trials Group Network; CARLOS III = 
Instituto de Salud Carlos III; HOMER = HAART 
Observational Medical Evaluation and Research Study; 
IHCS = International HIV Controllers Study; IRSICAIXA = 
AIDS Research Institute IrsiCaixa; SHCS = Swiss HIV 
Cohort Study; WAHCS = Western Australian HIV Cohort 
Study. 

DOI: 10.7554/eLife.01 123.007 
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performed three series of analyses {Figure 1): [A] human SNPs vs VL; [B] human SNPs vs HIV-1 
amino acids; and [C] HIV-1 amino acids vs VL. To test for association between human SNPs and 
HIV-1 amino acids, we used phylogenetically corrected logistic regression (Carlson et al., 2008; 
Carlson et al., 2012). For association testing between polymorphic amino acids in human HLA 
genes and HIV sequence variation, we used standard logistic regression (for a binary HLA amino 
acid) or a multivariate omnibus test (when more than one alternate allele was present) including 
sex, cohort, and the coordinates of the first two principal component axes as covariates. We used 
linear regression models in PLINK to test for association between human SNPs and VL, and between 
HIV-1 amino acids and VL (Purcell et al., 2007), including sex, cohort, and the coordinates of the 
first two principal component axes as covariates (Price et al., 2006). An additive genetic model was 
used for all analyses involving human SNPs. Significance was assessed using Bonferroni correction 
(significance thresholds of 7.25 x 1 2.4 x 1 0r 12 , and 1.6x1 0 s for analyses A, B, and C, respectively, 
Figure 1). 
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